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Abstract. This paper introduces our dedicated authenticated encryption scheme ICEPOLE. ICE- 
POLE is a high-speed hardware-oriented scheme, suitable for high-throughput network nodes or 
generally any environment where specialized hardware (such as FPGAs or ASICs) can be used to 
provide high data processing rates. ICEPOLE-128 (the primary ICEPOLE variant) is very fast. 
On the modern FPGA device Virtex 6, a basic iterative architecture of ICEPOLE reaches 41 
Gbits/s, which is over 10 times faster than the equivalent implementation of AES-128-GCM. The 
throughput-to-area ratio is also substantially better when compared to AES-128-GCM. We have 
carefully examined the security of the algorithm through a range of cryptanalytic techniques and 
our findings indicate that ICEPOLE offers high security level. 
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1 Introduction 

Protocols such as SSL/TLS [12, 16], the backbone of the Internet, are designed to provide data 
confidentiality and authenticity. Often the underlying algorithms of these protocols realize en- 
cryption and authentication separately (e.g., AES in CBC mode for encryption and HMAC- 
SHA1 for authentication). Although this approach leads to relatively easy security analysis, the 
performance, due to two separate algorithms, does not meet demands of modern applications. 
Hence, recently, the symmetric crypto community focuses its attention on dedicated authenti- 
cation encryption schemes, which aim to provide message encryption and authentication more 
efficiently. An interest in new efficient and secure solutions is manifested in the recently launched 
competition called CAESAR [1]. 

This paper presents our new design of the authenticated encryption scheme called ICEPOLE. 
It is a family of authenticated ciphers with three parameters: key length (128 or 256 bits), nonce 
length (between 0 and 128 bits), secret message number (between 0 and 128 bits). Our primary 



recommendation is ICEPOLE-128 which uses 128-bit key, 128-bit nonce, and 128-bit secret 
message number. A secret message number parameter is a new idea (proposed by the CEASAR 
competiton committee), which should add more flexibility and security especially for multiple- 
message network protocols. 

The claimed security level for the primary variant is 128 bits and it is supported by our 
extensive crypt analysis. ICEPOLE is based on the duplex framework introduced by Bertoni et 
al. in [8]. At the heart of the duplex framework is a permutation and the ICEPOLE permutation 
is our new design. In particular, inspired by the Keccak non-linear step [7], we introduce a new 
S-box with good security properties and low implementation cost. 

ICEPOLE is a high-speed hardware-oriented scheme, suitable for high-throughput network 
nodes or more generally any environment where specialized hardware (such as FPGAs or ASICs) 
can be used to provide high data processing rates. ICEPOLE-128 is very fast. On the modern 
FPGA device Virtex 6, a basic iterative architecture of ICEPOLE reaches 41 Gbits/s, which 
is over 10 times faster than the equivalent implementation of AES-128-GCM [23] (one of the 
most common standards for authenticated encryption). The throughput-to-area ratio is also 
substantially better than AES-128-GCM results. 

2 Specification 

2.1 Parameters 

ICEPOLE is a family of authenticated ciphers with three parameters: key length, secret message 
number length, nonce length. The key length is either 128 bits or 256 bits. The secret message 
number length is between 0 and 128 bits. The nonce length is between is between 0 and 128 
bits. 

2.2 Recommended Parameter Set 

Our primary recommended parameter set is: 128-bit key, 128-bit secret message number, 128-bit 
nonce. The ICEPOLE variant with these recommended parameter values is called ICEPOLE- 
128. We also define two ICEPOLE variants serving as drop-in replacements for AES-128-GCM 
and AES-256-GCM. These variants are ICEPOLE-128a (128-bit key, 0-bit secret message num- 
ber, 96-bit nonce) and ICEPOLE-256a (256-bit key, 0-bit secret message number, 96-bit nonce). 

The following specification refers to the primary recommendation ICEPOLE-128. A specifi- 
cation of ICEPOLE-128a and ICEPOLE-256a is nearly the same and the differences are given 
at the end of this section. 

2.3 State Organization and Notations 

The algorithm works on the 1280-bit state S. The state S is organized as the two-dimensional 
array S^]^] where each element of the array is a 64-bit word. The little-endian convention is 
used throughout the document. When we refer to the particular bit, we introduce the third index: 
S^x]^]^]. The mapping between the bits of vector v and those of S'fxjfy]^] is v[64(x + 4:y) + z\ = 
S[x][y][z]. If the bits of the state share the same z coordinates, they form a slice. As z ranges 
from 0 to 63, there are 64 slices in the state. If the bits of the state share the same x and z 
coordinates, they form a row. It is also convenient to introduce a notation which allows referring 
to the first n bits of the state. Let S^ n j denotes the first n bits of the state, namely those bits 
S'[a;][y][z] for which 64 (x + 4y) + z < n. 

We use the following notation: © (bitwise XOR), • (bitwise AND), -> (negation). 



2.4 Scheme Overview 



ICEPOLE-128 encrypts and authenticates a message with a 128-bit key and a 128-bit nonce. 
There are 3 phases of the algorithm as shown in Figure 1. 



Fig. 1. General scheme of ICEPOLE encryption and authentication. 
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At the heart of ICEPOLE there is the 1280-bit permutation denoted by P. Let us first 
describe this permutation. 



2.5 Permutation P 

P is an iterated permutation and a number of rounds is a parameter of the permutation. In the 
presented algorithm the 6- and 12-round variants (denoted by P§ and P12) are used. Each round 
R consists of five steps labelled by the Greek letters: \i (mu), p (rho), ir (pi), tp (psi), k (kappa). 



P = K0^07TOpo / U 



Each step updates the state as follows. 



In the [i step bits are mixed through the MDS (Maximum Distance Separable) matrix. Every 
20-bit slice is mixed through the matrix given below. Formally, a column vector (Zq, Z±, Z2, Z$) 
is multiplied by a constant matrix producing a vector of four 5-bit words. 
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The operations are done in GP(2 5 ). Here the multiplication is defined as the multiplication of 
binary polynomials modulo the irreducible polynomial x 5 + x 2 + 1 . There are only three distinct 
terms in the chosen matrix, namely 18, 2, 1 and they correspond to the polynomials x 4 + x, 
x, and 1, respectively. The fx step can be efficiently implemented with simple bitwise equations 
(see Appendix B). 



The p step is the bitwise rotation applied to each of the twenty 64-bit words of the state. The 
bitwise rotation moves bit at position z into position (z+r va i ue ) modulo 64. For each word r va i ue 
is different. 

S[x][y] := S[x][y] <g; offsets [x] [y] for all (0 < x < 3), (0 < y < 4) 



The rotation offsets are as follows. 
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7r reorders the words in the state. Words are moved from <S[x][y] to iSfx'Jh/] and the new coor- 
dinates (x',y') are calculated from the following simple formula. 

x' := (x + y) mod 4 

y' := (((x + y) mod 4) + y + 1) mod 5 



In the ip step the ICEPOLE S-box is applied to each of 256 rows of the state. The S-box maps 
a 5-bit input vector (Mo, M±, M4) to a 5-bit output vector (Zo, Z\, Z4). The S-box func- 
tionality can be easily described by the following bitwise equation. Operations on the index k 
are done modulo 5. The bitwise AND operator • is omitted for clarity. 

for all (0 < k < 4) 

z k = M k e {^M k+l M k+2 ) e (M0M1M2M3M4) e (^m 0 ^m 1 ^m 2 ^m 3 ^m 4 ) 
k: 

In k the 64-bit constant is xored with 5[0][0]. 

5[0][0] := 5[0][0] 0 constant [number OfRound] 

The constant values are taken as the output of a simple 64-bit maximum-cycle Linear Feedback 
Shift Register (LFSR). The polynomial representation of LFSR is x 64 + x 63 + x 61 + x 60 + 1. The 
LFSR state is initialized with the 64-bit vector '0123456789ABCDEF' (hexadecimal format) and 
then each cycle generates a subsequent constant. For implementors convenience the constant 
values are given in Appendix A. 



2.6 Initialization Phase 

First, the state is initialized with the 1280-bit pseudorandom constant. The constant was ob- 
tained by applying the Keccak-f[1600] permutation (an underlying permutation of the SHA-3 
standard) to the all-zero vector and truncating the result to 1280 bits. The state is initialized 
as follows. The values are given in hexadecimal using the little-endian format. 



S[0][0] := FF97A42D7F8E6FD4 
S[0][2] := 8C5BDA0CD6192E76 
S[0][4] := 30935AB7D08FFC64 
S[l][l] := A9A6B6260D712103 
5[1][3] := 435831OD0347C826 
5[2][0] := 05E5635A21D9AE61 
S[2][2] := 613670957SC46611 
S[2][4] := 8C3EE88A1CCF32C8 
S[3][l] := 1841^924^12(7509^4 
S[3][3] := 75F644.B97F30A13B 



5[0][1] := 90^^^5^0^44647(74 
5[0][3] := AD30A6F71B19059C 
5[1][0] := ££5A493F2317L>635 
5[1][2] := 81A57C16DBCF555F 
5[1][4] := 01F22FL41L45569F 
5[2][1] := 6ABEFEF28CC970F2 
S[2}[3] := J B87<75A554FL>00£C , 5 
5[3][0] := 940C7922AE3A2614 
5[3][2] := 16F53526£70465<72 
5[3][4] := ^F1FF7B5C£?CA249 



Once the state is filled with the constant, the 128-bit key K and the 128-bit nonce are introduced 
into the state. Kq and K\ denote two 64-bit words of the key, nonceo and noncei denote two 
64-bit words of the nonce. 



5[0][0] :=5[0][0]e^o 

5[1][0] :=5[1][0]©^! 

S[2][0} := 5 [2] [0] ©nonceo 

S[3][0] := 5 [3] [0] ©noncei 

Then, the P\i permutation is run on the state S. 
S := P 12 (S) 



2.7 Processing Phase 

The input data is processed in blocks. First, a single secret message number block a SMN is 
processed, then associated data blocks af D and next plaintext blocks af. a SMN and af are 
authenticated and encrypted whereas the associated data blocks af D are only authenticated. 

The a SMN block length is 128 bits. af D and af block length has to be between 0 (the empty 
block) and 1024 bits. Each block is padded to be 1026 bits long and the padding rules are as 
follows. First, every block is appended with the frame bit. The frame bit is set to 1 for the last 
a AD D i oc k anc j a \\ g-P exce pt t ne i as t one _ p or all other blocks the frame bit is set to 0. Once 
the frame bit is appended, a given block is padded with a simple rule: append 1 and such a 
number of 0's which gives 1026-bit block. Thus the padded block has at least two padding bits 
(the frame bit and 1) and maximally 1026 padding bits (in case of the empty block). 

In the processing phase the ciphertext blocks q are produced and the state is updated. 

csmn = S [128l © a SMN 

a SMN . = pad{a SM N) 

*S'L1026J : = <%026J © a SMN 
S := P 6 (S) 

for all blocks af D { 
af D := pad(af D ) 
<%026J := 5 , Ll0 26j (Baf- D 
S := P 6 (S) 

} 



for all blocks af { 

Ci = Syj © 0~f (I is a length of erf) 
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2.8 Tag Generation 

When the blocks processing is finished, the 128-bit authentication tag T is derived. (To and T\ 
denote two 64-bit words of T.) 

T o :=S[0][0] 
Ti:=5[l][0] 



2.9 Decryption and Verification 

Decryption and verification are done basically with the same scheme as for encryption. The only 
difference is that now the input data are the ciphertext blocks and the associated data blocks. 
Figure 2 shows the scheme. We stress that the same permutation P (and not its inverse) is used 
for decryption and verification. Once the processing phase is finished, the tag T is generated 
and verified with the tag received from the sender. If the tags match, the data is authenticated. 
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2.10 ICEPOLE-128a 

We specify ICEPOLE-128a to have a drop-in replacement for AES-128-GCM run with most 
common parameters, namely a 96-bit nonce and a 128-bit tag. 

The only differences between ICEPOLE-128 (specified above) and ICEPOLE-128a is that in 
ICEPOLE-128a there is not a secret message number (the a SMN block is empty) and a nonce is 
96 bits long. The nonce is padded with 32 zeros and introduced into the state in the same way 
as for ICEPOLE-128. 



2.11 ICEPOLE-256a 



We specify ICEPOLE-256a to have a drop-in replacement for AES-256-GCM run with most 
common parameters, namely a 96-bit nonce and a 128-bit tag. 



ICEPOLE-256a encrypts data with a 256-bit key, a 96-bit nonce and the data is authenti- 
cated with a 128-bit tag. There is not a secret message number and a 96-bit nonce is padded with 
32 zeros. A 256-bit key consists of four 64-bit words Kq . . . K3 and the padded nonce consists of 
two 64-bit words. The key and the nonce are introduced into the state as follows. 
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The associated data blocks af D and the plaintext blocks af have the length between 0 
and 960 bits. The padded blocks are 962 bits long. For ICEPOLE-256a the number of blocks 
encrypted under a single key should be less than 2 62 . All other parameters and steps of the 
specification are the same as for ICEPOLE-128. 

3 Security Goals 

In Table 1 we quantify the intended number of bits of security (i.e., the logarithm base 2 of the 
attack cost) in each of the given categories. 



Table 1. The intended number of bits of security 



Goal 


ICEPOLE-128 


ICEPOLE-128a 


ICEPOLE-256a 


confidentiality for the plaintext 


128 


128 


256 


confidentiality for the secret message number 


128 


128 


256 


integrity for the plaintext 


128 


128 


128 


integrity for the associated data 


128 


128 


128 


integrity for the secret message number 


128 






integrity for the public message number 


128 


128 


128 



A public message number is a nonce. For ICEPOLE-128a and ICEPOLE-256a there is no 
secret message number. For 128-bit key ICEPOLE variants the number of blocks encrypted 
under a single key should less than 2 126 . For 256-bit key variants the limit is 2 62 blocks. 

A user of ICEPOLE is required to use a nonce. However, in the case of nonce reuse ICE- 
POLE provides some intermediate level of robustness. The nonce reuse leads to leaking XOR 
of plaintexts (via XOR of ciphertexts) and this cannot be avoided in our case. The secret key 
used with the same nonce also leads to the situation where the adversary can control the XOR 
differences after the 12-round initialization. But our cryptanalysis shows (see Section 4) that the 
6-round permutation P (run at the processing phase) is secure, in particular against differential 
and linear cryptanalysis. Therefore we argue that in the case of nonce reuse the key-recovery 
attack is not possible. Please note, that for ICEPOLE-128, if a public message number is reused 
but a secret message number changes full confidentiality on plaintexts is preserved and the 
key-recovery attack is not possible. Only XOR of secret message numbers leaks (via XOR of 
ciphertexts). 

4 Security Analysis 

In 2010, Bertoni et al. introduced the duplex construction [8], which provides the framework for 
an authenticated encryption scheme. ICEPOLE is based on this construction and thus ICEPOLE 



general security claims are inherited from it. The duplex construction can be seen as a particular 
way of using the sponge construction [6]. Similarly, there are two parameters, namely r (bitrate) 
and c (capacity). The sum of these two parameters makes the state size. Different values for 
bitrate and capacity give trade-offs between speed and security; a higher bitrate gives a faster 
construction at the expense of a lower security. For ICEPOLE-128, r is 1026 bits and c is 254 
bits. In case of ICEPOLE variants with a 256-bit key, r is 962 bits and c is 318 bits. 

In [3], it was proved that the sponge construction is secure against generic attacks with 
complexity below 2 C//2 . However, when a sponge or duplex object is used in conjunction with a 
secret key, one can prove more refined bounds taking into account the data complexity. In [4] 
Bertoni et al. proved that if the data complexity is limited to 2 a r-bit blocks, the keyed mode 
withstands generic attacks with time complexity up to 2 c ~ a calls of the underlying permutation. 
If a < c/2, this results in an increase of the security strength from c/2 to c — a. This comes 
in handy particularly for 256-key ICEPOLE variants, where we would like to keep 256 bits 
of security without expanding the state or introducing any serious changes in the algorithm's 
specification. By limiting the number of blocks (encrypted under the same key) to 2 62 , ICEPOLE- 
256 stands up to any attack up to 2 318 ^ 62 = 2 256 (unless easier generically) . For ICEPOLE-128 
the limit (rather purely theoretical) is 2 126 blocks and hence the security level is 254 — 126 = 128 
bits. 

We claim that the security level can be proven under the assumption that the underlying 
permutation P has not any exploitable properties (there are no structural distinguishers of 
the permutation). Therefore the security analysis of ICEPOLE comes down to analysis of the 
permutation P. Below we give our cryptanalysis indicating that P is indeed a secure permutation. 

A user of ICEPOLE is required to use a nonce. However, in the case of nonce reuse ICEPOLE 
provides some intermediate level of robustness. The nonce reuse leads to leaking XOR of plain- 
texts (via XOR of ciphertexts) and this cannot be avoided in our case. The secret key used with 
the same nonce also leads to the situation where the adversary can control the XOR differences 
after the 12-round initialization. But our cryptanalysis (given below) strongly indicates that the 
6-round permutation P (run at the processing phase) is secure, in particular against differential 
and linear cryptanalysis. Therefore we argue that in the case of nonce reuse the key-recovery 
attack is not possible. Authenticity (integrity) in the duplex construction does not need a nonce 
requirement, thus is not threatened by nonce reuse [8]. 

4.1 Differential Cryptanalysis 

Differential cryptanalysis, introduced by Biham and Shamir [9], has become very powerful tech- 
nique of modern cryptanalysis. One of the most convincing way of showing the resistance against 
differential attacks is to provide a lower bound on the weight of any differential characteristics 
(also called differential trails or paths) over a number of rounds. For example, in the AES the 
structure of the cipher and its diffusion properties allow to provide such bounds analytically [11]. 
However for 'bit-oriented constructions (e.g., Keccak or MD6 hash function) it is not possible 
to derive differential characteristics bounds in a very straightforward and convenient manner. 
In such cases computer-aided proofs are provided and for ICEPOLE we take this approach. 

Computer Aided Proof 

A brute-force strategy to check all possible characteristics (even for a very small number 
of rounds) fails. The 1280-bit state is too big, even when exploiting all possible symmetries. 
Instead of the plain brute-force we used a SAT-solver. A SAT solver is an algorithm, which 
decides whether a given propositional (Boolean) formula (typically described in the Conjunctive 
Normal Form) has a satisfying valuation. Generally, to solve a problem: (1) translate the problem 



to SAT (in such a way that a satisfying valuation represents a solution to the problem); (2) run 
a favourite SAT solver to find a solution. 

First, we focused on the problem of finding the 3-round characteristic with the minimum 
number of active S-boxes. The problem was encoded as a SAT formula in the Conjunctive Normal 
Form with the aid of the CryptLogVer toolkit [18]. CryptoMiniSat2 [25] is able to solve it in a 
few hours on a desktop PC. The solution, that is the minimum number of active S-boxes for 3 
rounds, is 9. Then, we tried to repeat the experiments for 4 rounds, but the problem was too 
hard for the SAT-solver. However, if we slightly change the problem and ask the solver about 
a particular number (up to 13) of active S-boxes on the 4-round differential path, the answer 
is provided by the solver. For 4 rounds there are no paths with 13 or fewer S-boxes. Again, 
checking a higher number of S-boxes turned out to be infeasible. 

If a number of active S-boxes is at least 14 (for 4 rounds) and the highest probability of a 
difference transition through the S-box is 2~ 2 (deduced from the difference distribution table 
of the S-box, given in Appendix C), then the lowest weig ht for 4 rounds is 2" 2 * 14 = 2" 28 . So 
for 12 rounds a weight equals 2 -28 * 3 = 2~ 84 and hence data complexity for the attack is 2 84 
plaintexts. Please note that this is already a much bigger number than the limitation (2 62 ) for 
ICEPOLE-256 on a number of blocks of plaintexts encrypted under the same key. 

We believe that the complexity of differential attack should be much higher than the lower 
bound of 2 84 plaintexts. The first reason for that is the difference transitions through the S-box 
with the weight 2~ 2 happen very rarely. Out of 337 possible difference transitions only 10 (3%) 
has the weight 2~ 2 . Most of the transitions (216) has the weight 2 -4 and the average weight is 
2 3 ' 4 . The second argument strongly indicating that ICEPOLE is resistant to a differential attack 
is our experimental results. These gives some more insight into the difference propagation in 
ICEPOLE. 

Differential Path Search 

We were inspired by the work of Due et al. [15] where the algorithm for differential path search 
was given for the Keccak permutation. They managed to provide the best differential paths for 
the round-reduced variants of the permutation. Our permutation shares some key features with 
the Keccak permutation (in particular how the state is organized and 'bit-oriented' propagation) 
and hence we think that a similar algorithm may be fruitful also for our analysis. 

The goal of the algorithm is to derive differential paths by maintaining the bit difference 
Hamming weight as low as possible. We note that p, p, tt are all linear mappings (denoted 
altogether by A, while ip acts as the non-linear S-box. k (adding round constants) does not affect 
differential analysis in any way. Furthermore, p and tt do not change the number of active bits 
in a differential path, but change only bit positions. Hence, p and ip are critical when analysing 
differential paths. Since ip is followed by p in the next round (ignoring «), we consider these two 
mappings together by treating a slice of the state as a unit, and try to find the potential best 
mapping of the slice through ip with the following rule. 

• Given an input difference of the slice find all possible output differences by looking into 
the S-box differential profile. Then, among all combinations of possible output differences, 
choose a combination which would give the state the minimum Hamming weight after an 
application of p. 

It is not possible to check all possible states as the starting points for a differential path 
because the state is too big, even if we take advantage of symmetries. We limit the space of 
starting points to the states with a single active bit. For the 20-bit slice there are 20 such cases. 
For our permutation (as in the case of the Keccak permutation) a differential path is invariant 
through position rotation along the z axis so choosing a particular slice does not matter. 



We start our search from b\ point, i.e., the state after the linear mappings (denoted by A) in 
the second round, and compute backwards for one round, and a few rounds forwards, as shown 
below. 



oo < oo < ai <; bi — ^ «2 — > o-2 — > as — > 03 . . . 

The forward part is longer than the backward part because the diffusion of ^T 1 is better 
than for /x, so it will be easier to control the bit differences Hamming weight for several rounds 
forwards (instead of backwards). Table 2 shows probabilities of the best paths we found with 
the aid of the algorithm. 

Table 2. Best differential paths results. The third column shows the weights of rounds for a given path. 



rounds 


total probability 


products 


1 


2" a 


2~ 2 






2 


2 -io 


2~ 8 • 2~ 2 






3 


2-18.4 


2 -8.4. 2 -2. 


2- 8 




4 


2-52.8 


2 — 8.8 2 — 2 


2 -8 . 2 -34 




5 


2-186.2 


2-10.4 . 2 -2 


. 2 -8 . 2 -36 


2-129.8 


6 


2~555.3 


2-10.4.2-2 


■ 2~ 8 • 2" 36 


2-129.8 . 2~369 



The weight of the 3-round path matches the bound we provided (2~ 18 ) very closely. We in- 
vestigated up to 6 rounds as the complexity of the attack exploiting the 5-round path is already 
completely intractable. 

Internal differentials The best collision attack against Keccak was obtained through the 
technique called internal differentials [13]. While in standard differential attacks we consider 
two different plaintexts, in internal differential attacks only one plaintext is considered, and 
the statistical evolution of the differences between its parts is followed. In the attack against 
Keccak two properties were exploited, which led to the successful attack against the round- 
reduced Keccak. These properties are: very low Hamming weight constants (which helps to keep 
the state in the desired symmetry) and the fact that the state is initialized with the all-zero 
vector (which allows to construct the initial difference). For ICEPOLE it is not the case as the 
state is initialized with the pseudorandom constant and the round constants have much higher 
Hamming weight. Therefore we conclude it is not possible (or heavily limited) to successfully 
apply the internal differential technique to our scheme. 

4.2 Linear Cryptanalysis 

Linear cryptanalysis, formally introduced by Matsui [21], has become another powerful tool 
against modern cryptographic primitives. The main idea is to construct the linear approximation 
of the algorithm. In many ways this technique resembles differential cryptanalysis. Tracing the 
evolution of differences has the counterpart in tracing linear masks. Usually the complexity of 
the attack is also determined by the number of active S-boxes in the trail. One excellent example 
of exploiting the duality between these two techniques is the analysis of AES provided by its 
designers. 



Although the structure of ICEPOLE does not allow for a straightforward and completely 
parallel analysis with respect to the two types of attacks, we think that ICEPOLE (and its 
permutation P) should offer very similar security margin against linear and differential crypt- 
analysis. First indication of it is the examination of the linear profile of the S-box. (The complete 
profile is given in Appendix D.) The highest bias of the linear approximation of the S-box is 
2 -2 and on average the bias is lower as the value of 2 -2 happens rarely. (Note that the highest 
probability of difference transitions in the S-box is also 2~ 2 ). The complexity of the linear attack 
is not only determined by the S-box properties but also by a number of active S-boxes on the 
trail. The [i step brings diffusion to the algorithm and hence it is the main factor for increasing 
a number of active S-boxes. The fj, step affects linear and differential trails in the same way. 
Therefore we conclude that the complexity of the linear attack against ICEPOLE should be 
comparable with differential analysis and after 5-6 rounds the complexity becomes completely 
intractable. 

4.3 SAT-based (Logic) Cryptanalysis 

We encoded the following problem into SAT: an adversary knows a part of the input state, a 
part of the output state and the goal is to retrieve the unknown part of the input state. For 
ICEPOLE this problem models two types of attacks. The first type is the key recovery where 
an unknown part of the state is a secret key. The second type of the attack is the state recovery 
(in the processing phase) where the attacker tries to recover the unknown capacity part of the 
state. 

To encode the problem into a SAT instance we used the toolkit presented in [18]. We obtained 
a SAT instance describing a single round of P with roughly 6400 variables and 35300 clauses. 
In attacks we used CryptoMiniSAT2, a gold medallist from recent SAT competitions [25]. We 
tried to solve three variants of the problem where 64-, 80-, and 128-bit part of the input state 
remains unknown. For 2 rounds CryptoMiniSAT2 was able to find the solution in a few seconds 
on a desktop PC. For 3 rounds only 64-bit variant of the problem was solved (also in a matter 
of seconds) and for 4 rounds, with 48-hour time limit, CryptoMiniSAT2 was unable to provide 
any solution. It looks as the hardness of the problem grows super exponentially in a number 
of rounds and this effect has been also observed in SAT-based attacks on other cryptographic 
primitives [18,24]. Thus we conclude that ICEPOLE with the 12-round initialization and the 
6-round processing phase is secure against the SAT-based attack. 

4.4 Rotational Cryptanalysis 

The technique was formally introduced in [20]. Unlike differential analysis, where the attacker 
follows the propagation of the xor differences of two plaintexts through the cryptographic system, 
in rotational analysis, the adversary investigates the propagation of the rotational relations 
between plaintexts. In [22] rotational cryptanalysis was applied to Keccak and since there are 
some similarities between ICEPOLE and Keccak, we take a closer look whether that technique 
could be used against ICEPOLE. In the attack on Keccak two properties were exploited, namely 
very low Hamming weight constants (which helps keep the states in the desired rotational 
relation) and the fact that the state is initialized with the all-zero vector (which allows to 
construct the initial rotational relation). For ICEPOLE it is not the case as the state is initialized 
with the pseudorandom constant and the round constants have much higher Hamming weight. 
Therefore we conclude it is not possible (or heavily limited) to apply rotational cryptanalysis to 
our scheme. 



4.5 Techniques Exploiting Low Algebraic Degree 



There are several cryptanalytic techniques, which exploit a low algebraic degree. These are, for 
example, the cube attack [14] or the zero-sum distinguisher [2]. An algebraic degree of a single 
round of P (or its inverse) is 4. Then, after four rounds the algebraic degree is 256, which stops 
the mentioned attacks from reaching the attack complexity lower than the claimed security level 
2 128 . Thus ICEPOLE with its 12-round initialization is completely secure against techniques 
exploiting a low algebraic degree. 

5 Features 

ICEPOLE is designed for high-throughput network nodes or any environment where specialized 
hardware (such as FPGAs or ASICs) can be utilized to provide desired high data processing 
rates. As the duplex-based construction it inherits useful features: 

- it supports associated data processing 

- it is an online cipher (ciphertext block a can be generated without knowing the plaintext 
block (Jj+i) 

- it can provide intermediate tags 

- encryption is not expanding 

Our new permutation P lets ICEPOLE works at the very high speed on the modern FPGA 
devices. Also our basic, non-optimized software implementation results are promising. Finally, 
we show how to process in parallel with ICEPOLE. 

5.1 Hardware Performance 

A proof-of-concept basic iterative architecture of ICEPOLE-128 was implemented. Figure 3 
shows an overview of a datapath design. The presented cryptographic core is capable of per- 
forming encryption and decryption, and contains a full padding unit. 



Fig. 3. A proof-of-concept single iterative round design for the hardware implementation of ICEPOLE 
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AES-GCM is used as a basis of our comparison as it is one of the most widely accepted 
standards for authenticated encryption [23]. The same basic iterative architecture is implemented 
for a direct comparison. Both implementations use also the same interface and communication 
protocol in order to reduce any discrepancies between the two designs. Similar to ICEPOLE, 
AES-GCM contains the full padding unit and supports both encryption and decryption within 
a single core. 



Both cryptographic cores were described using VHDL language and verified against software 
generated test vectors using ModelSim. The results were generated using ATHENa [17] using 
two high-performance FPGA families from two major FPGA vendors, Xilinx and Altera. These 
FPGA families are Xilinx Virtex 6 and Altera Stratix IV, respectively. No dedicated resources, 
such as Block RAMs or DSP units, were used in either implementation. The comparison between 
ICEPOLE-128 and AES-128-GCM using a basic iterative architecture is shown in Table 3. The 
throughput shown in the table is based on the throughput of long messages. 

Table 3. The comparison between ICEPOLE-128 and AES-128-GCM using an iterative architecture 
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With the exception of resource utilization, ICEPOLE-128 consistently outperforms AES- 
128-GCM in terms of the throughput and the throughput-to-area ratio. For Xilinx Virtex 6, 
with only 60% increases in area, ICEPOLE-128 achieves almost 12 times the speed of AES- 
128-GCM, and seven times higher the throughput-to-area ratio. For Altera Stratix IV, due to 
the unique behaviour of Altera Adaptive Look Up Tables (ALUTs), the resource utilization is 
similar for both algorithms, with ICEPOLE-128 consuming only 13% more area. At the same 
time, ICEPOLE-128 outperforms AES-128-GCM by a factor of 11 in terms of throughput and 
a factor of 9 in terms of the throughput-to-area ratio. 

5.2 Software Performance 

While the primary focus of the ICEPOLE design is hardware performance, the cipher is also 
amenable to efficient software implementations. The three steps that require nontrivial imple- 
mentations are p, p and ip. They all can be easily implemented on platforms supporting 64-bit 
XORs, logical ANDs and rotations. We measured that a rather straightforward C implemen- 
tation compiled for speed (with no beyond-C optimization efforts like code vectorization using 
AVX or intrinsics use) runs for very long messages at about 9 cycles per byte on Intel Ivy Bridge 
i5-3320M processor. The same implementation runs at about 8 cpb on a Haswell (Intel Xeon E3 
1275) machine. 

We believe there is still room for possible improvements. A better code optimization (e.g., mak- 
ing sure that the compiler uses the andn a, b instruction on Haswell for —>A- B extensively used 
in the step ip) could lead to a better performance than the reported 8 cycles per byte. Addition- 
ally, one could think about an AVX-based implementation where the whole state is kept in five 
YMM registers. Compared to the pure C code this could save time on memory loads and stores 
but at the expense of the more complex [i step. For 32-bit platforms, only rotation performance 
will scale worse than linearly (compared to the straightforward 64-bit version). It is because in 
such a case the rotations need to be combined from more than two instructions. 

5.3 Parallel Processing 

To be competitive with some block cipher based modes like OCB (Offset CodeBook) or GCM 
(Galois Counter Mode), the AEAD scheme should allow parallel processing. 



Fig. 4. ICEPOLE used in the parallel mode. For clarity of the figure only plaintext blocks are shown, but the 
parallel mode supports associated data as well. 




However, the scheme based on the duplex construction can no be paralellized at the algo- 
rithmic level. In [8] the authors just briefly state that tree-like processing (described in detail 
for hashing, see Section 3.3.2 in [6]) could be used also for the duplex-based AEAD providing 
parallel streams with some overhead. Below we describe our simpler scheme where ICEPOLE is 
used to provide parallel processing. 

As shown in Figure 4, blocks are processed in separate streams. Each stream has its own 
initialization phase that is the P\2 call with a key, a nonce and a counter as an input. Then, 
each stream of blocks is processed in parallel. Each stream is ended with generating the 128-bit 
tag ti, which are eventually combined with the XOR operation to produce the final tag T. 

To simplify security analysis we let an attacker know intermediate tags to,t±, . . . ,t n . This 
way we can analyze every stream separately. And we already discussed the security analysis of 
a single stream (which is ICEPOLE in normal, sequential mode) and concluded it is secure. 
Then the only question remains whether the final XOR operation (producing the tag T) can be 
somehow exploited by the attacker. As to, t±, . . . , t n are already known, the key recovery attack 
(recovery of the state) starting from T does not make sense, it would only make the task harder. 
A tag forgery would mean that the attacker can control the xor differences between to, t\, . . . , t n . 
But this would contradict the security of a single stream as we showed that after the initialization 
phase an attacker has no control of differences. (See Section 4.1.) Thus we conclude that the 
parallel processing we propose for ICEPOLE is secure. 



6 Design Rationale 



We have aimed at high-speed, hardware-oriented authenticated encryption scheme, suitable for 
high-throughput network nodes or any environment where specialized hardware (such as FPGAs 
or ASICs) can be utilized to provide desired high data processing rates. Our main inspiration 
comes from the duplex construction with the round-reduced Keccak-f permutation [5]. We have 
decided to keep the general framework (namely the duplex construction) and design an under- 
lying permutation from scratch. 

In the Keccak-f permutation, the linear step 9 brings most of diffusion to the algorithm and 
is roughly two times slower (when considering the FPGA design) than the non-linear part (a 
layer of S-boxes). Our general approach to the design of the permutation P has been to make 
steps more balanced. We have wanted to make the linear part simpler (faster) and improve the 
properties (diffusion, algebraic degree) of the non-linear part (in comparison to Keccak-f). The 
challenge has been that a more complex S-box layer should not nullify the gain from introducing 
a lighter linear part. Our second starting idea was to take into consideration cryptanalysis of 
Keccak-f and its round-reduced variants. Those findings have determined some of our decisions 
for ICEPOLE and its P permutation. 

6.1 Permutation P Steps 

Let us first explain design rationale behind the P permutation steps. 

We have aimed at a possibly simple and implementation- friendly linear step. The step does not 
have to have an efficient inverse as in the duplex construction the permutation is calculated only 
in one way (both for encryption and decryption). Additionally, we have required that the linear 
step has very good diffusion properties. In [19] Junod and Vaudenay presented their research 
on building MDS matrix (known for an excellent diffusion property) under the criteria, which 
perfectly suit our needs, that is an efficient implementation and neglecting an inverse of the 
matrix. We have decided to use one of the 'optimal' matrices presented in [19]. 

The j! step also helped determine the size and organization of the state. First, we tried to 
keep the same state organization as in Keccak-f[1600] that is 5 x 5 x 64. However the 'optimal' 
5x5 MDS matrix did not give us a clear advantage (in terms of hardware implementation 
efficiency) over the linear step presented in Keccak. Hence we have decided to use the smaller 
4x4 matrix, which would operate on four 5-bit vectors. The linear operation based on the 
chosen matrix can be implemented in just a single layer of LUTs in the modern FPGA devices. 
Consequently, to let our new linear step be applied naturally, the state has been organized as 
the two dimensional array 4 x 5 of 64-bit words, giving the 1280-bit state. 

P- 

The p step is essential to bring diffusion along z axis in the state. Otherwise a given bit would only 
affect bits from its slice (the bits sharing the same z coordinate). The 20 offsets are calculated 
from a simple formula i(i + l)/2 modulo the word length (64 bits in the case of ICEPOLE). 
This formula is the same as in the Keccak permutation. A nice feature of this formula is that 
in the case of shorter word lengths, each word (or nearly each word) has a distinct offset value. 
This might come in handy when one tries to build ICEPOLE variant with a smaller state, better 
suited for constrained environments. 



7r: 

The tt step reorders the words in the state. We have introduced this step to bring extra diffusion 
between the words (which is already provided by p and ■0). In hardware, tt and p are 'cheap', 
their computational cost corresponds to wiring. The tt formula has been chosen for its simplicity. 

We have aimed at the non- linear step (an S-box), which would have the following properties: good 
differential and linear profiles, an algebraic degree higher than 3, compact boolean circuit (low 
implementation cost). We have concluded that the Keccak S-box would not be the best choice 
mainly for its slow diffusion. Every bit affects only 3 others whereas we need better diffusion to 
complement \i. Secondly, the Keccak S-box algebraic degree is only 2 and for a small number 
of rounds techniques exploiting a low algebraic degree might be a threat. Therefore, ideally, 
we would like to keep good differential and linear profiles of the Keccak S-box and increase its 
diffusion and the algebraic degree. Our idea to achieve this goal was as follows. If we change 
the truth table of the Keccak S-box very little, differential and linear profiles should stay much 
the same (Though it needs to be verified.) Hopefully, a small change would improve diffusion 
and increase the algebraic degree. In the Keccak S-box the input vector '00000' is mapped onto 
'00000' and '11111' onto '11111'. If we switch them ('00000' '11111' and '11111' '00000'), 
this seemingly tiny change gives us all we want. Now every output bit depends on all 5 input 
bits (better diffusion then) and the algebraic degree now equals 4, also the inverse of the new 
S-box has degree equals 4. The boolean description is still very compact, the equations are given 
in Section 2. As expected the differential and linear profiles remain very the same, keeping their 
good properties. The profiles are given in Appendix. 

k: 

k adds the 64-bit round constant and for each round a constant is different. Without k all rounds 
of the permutation P would be equal making it subject to attacks exploiting symmetry such as 
slide attacks [10]. The constants used in Keccak have very low Hamming weight and this feature 
was exploited in two cryptanalytic attacks [13, 22] against the round-reduced Keccak. These 
results motivated us to introduce constants with much higher Hamming weight. The constant 
values are taken as the output of a simple 64-bit maximum-cycle Linear Feedback Shift Register 
(LFSR). The polynomial representation of LFSR is x 64 + x 63 + x 61 + x 60 + 1. The LFSR state 
is initialized with the 64-bit vector '0123456789ABCDEF' (hexadecimal format) and then each 
cycle generates a subsequent constant. Thus n can be implemented as a simple LFSR circuit or 
a precomputed look-up table. 

Steps order within a round 

H is the step, which provides the best mixing between the unknown part of the state (a secret 
key K) and the remaining part of the state which would be known to the attacker. Hence we 
have placed \x as the first step in a round. The order of other steps is arbitrary. 

6.2 ICEPOLE Parameters and Decisions 

ICEPOLE works on the 1280-bit state and the reason for that is explained above in the subsection 
on the \i step. ICEPOLE-128 (our primary recommendation) uses the 1024-bit input data block 
to be more hardware friendly. This value is a power of 2 and allows a more natural I/O operation 
in hardware as opposed to the slightly bigger size of 1088 bits (which could be also a choice). 
With 6-round processing phase where a large amount of data must be transferred within a short 



period, a non-power-of-2 block size can introduce an inefficiency in data transmission when the 
I/O width is large. Furthermore, with the 1024-bit input data block, a hardware implementation 
can more efficiently uses its storage, which can be important where aiming for an extremely small 
design. 

Before P is applied, the state is initialized with the pseudorandom 1280-bit constant. This 
decision has been motivated by cryptanalysis on the round-reduced Keccak where two different 
techniques [13, 22] have exploited the fact the state is initialized with the all-zero maintaining 
many different symmetries. 

The number of rounds in Initialization is 12. This value is based on our differential crypt- 
analysis shown in Section 4. After Initialization an adversary should not have any control over 
the differences (when mounting the differential attack). The experiments indicate that 6 rounds 
are sufficient and we have doubled this value to get a solid security margin. 

The number of rounds in Processing Phase is 6. This value is based on our SAT-based 
cryptanalysis given in Section 4. We were able to recover a small unknown part of the state for 
3 rounds. To get a solid security margin we have doubled the number of rounds in Processing 
Phase to 6. 

The frame bit (introduced as a part of the padding) is needed for security analysis of the 
duplex construction working in the authenticated encryption mode [6, Section 4.1.5]. The chosen 
padding rule is the simplest sponge-compliant padding [6, Definition 2]. 

The designers have not hidden any weaknesses in this cipher. 
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Appendix 



A 



The round constants used in the n step are given below. The values are given in hexadecimal 
using the little-endian format. 



constant [0] 
constant [2] 
constant [4] 
constant [6] 
constant [8] 
constant [10 



= 0091A2B3C4D5E6F7 
= 002468ACF13579BD 
= 00091A2BFC4D5E6F 
= 0002468AFF13579B 
= 000091A2BFC4D5E6 
:= 00002468EFF13579 



constant [1] 
constant [3] 
constant [5] 
constant [7] 
constant [9] 
constant [11] : 



0048D159E26AF37B 
00123456F89ABCDE 
00048D15FE26AF37 
000123457F89ABCD 
000048D1DFE26AF3 
= 00001234F7F89ABC 
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The fi step changes the S state according to the following equations. 
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Table 4. Difference distribution table of the S-box. Input and output differences are given in the hexadecimal 
format. Each element of the table represents the number of occurrences of the corresponding output difference 
AOUT given the input difference AIN. For clarity '-' denotes 0. 
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Table 5. Linear profile of the S-box. Input and output masks are given in the hexadecimal format. Each element 
in the table represents the number of mismatches between the linear equation represented by the input mask 
IN and the linear equation represented by the output mask OUT. Dividing an element value by 16 gives the 
probability that the corresponding equations are not equal. 
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